Over the last two years, COVID-19 genomic data have been collected and stored in two central data repositories, GISAID and GenBank.
GISAID is a global science initiative that provides open-access genomic data of influenza viruses and the SARS-CoV-2 virus. With over 6 million submissions, GISAID currently hosts the largest repository for SARS-CoV-2 sequences. Because GISAID data are not in the public domain, GISAID data can only be accessed after creating a username and agreeing to a Database Access Agreement.
GenBank is an annotated collection of publicly available DNA sequences. It is the genetic sequence database of the US NIH. GenBank currently has over 3 million sequence read archive (SRA) runs and over 3.8 million nucleotide records for SARS-CoV-2. This database was created to provide access to the most up-to-date and comprehensive DNA sequence information, therefore there are no restrictions on the use or distribution of this data.
Though GISAID contains more variant data for SARS-CoV-2 globally, for projects interested in looking at data in the US may find that GenBank is sufficient. This is because much US sequencing is conducted through CDC contracts that stipulate a GenBank submission.
There are a number of naming systems in use for labeling variants for the SARS-CoV-2 virus.
Two naming systems relevant for tracking variants are Nextstrain and PANGO.
The Nextstrain naming system for COVID-19 involves labeling major clades by the year they first emerged, and a letter. For example, 19A was the first major clade detected in 2019.
A new major clade is named if it is at least 2 nucleotides away from its parent clade and if it reaches 20% frequency in the global population or for at least 2 months, or if it reaches 30% regional frequency for at least 2 months. A major clade may also be named if it is a variant of concern, even if it has not reached the required population frequency.
To signify clade major lineages with a hierarchical structure, emerging clades may be written as a string of parent clade lineages. For example, 19A.20A.20C for the major clade 20C. In the Nextstrain notation, emerging clades are listed with a parent clade and their defining nucleotide mutation or amino acid mutation. For example, 19A/28688C or 20B/S.484K. To note variants of concern, this nomenclature also labels them with their relevant spike mutation and a version numbering. For example, 20I/501Y.V1
Additional information regarding the original naming naming conventions using the Nextstrain naming structure can be found at the link: https://nextstrain.org/blog/2020-06-02-SARSCoV2-clade-naming and updated nomenclature can be found at the link https://nextstrain.org/blog/2021-01-06-updated-SARS-CoV-2-clade-naming
The rules of Pango lineage naming are complex. Information describing the naming rules for Pango lineages can be found here. In the Pango naming system, lineages are named to signify clusters of infections with shared ancestry and to highlight epidemiologically-relevant events.
For modelers conducting a variant-level analysis, we recommend using Nextstrain naming conventions. For modelers wishing to conduct a sub-variant analysis, Nextstrain naming conventions may not be specific enough and therefore PANGO naming conventions may be useful in these settings.
When looking to access genomic data, raw data is stored in a number of different file types. For variant level analyses, the “metadata” file described below is a flat tab-separated dataset that includes one row per sample and has information such as the sample’s origin, cell line, and preparation method, as well as which variant was identified in the sample.
The FASTA file contains the detailed nucleic acid or amino acid sequence information in a text fie format for for a single sample.
The Nextstrain project makes daily snapshots of GenBank data available for the US and the world. Specifically, the flat tab-separated file available at https://data.nextstrain.org/files/ncov/open/metadata.tsv.gz is updated daily, typically around 11am or noon US Pacific time. This file is large (>350MB as of 2022-02-09). For the below, we assume that you have downloaded this file and unzipped it so the .tsv file can be read in directly.
A codebook for the fields in the dataset are available here.
Let’s start by loading some tidyverse packages that will be useful for us and then by reading in the dataset.
library(tidyverse)
library(lubridate)
theme_set(theme_bw())
genbank_global <- read_tsv("data/metadata.tsv")This is a large file, with 3707198 rows. For starters, we create a version of these data that contain only US data, and only retains columns we are interested in. Further, we will reformat certain columns.
us_dat <- genbank_global %>%
filter(country=="USA") %>%
mutate(date = ymd(date),
date_submitted = ymd(date_submitted),
reporting_lag = as.numeric(date_submitted - date)) %>%
select(strain, virus, Nextstrain_clade, ## info on the virus
region, country, division, location, ## info on location
host, sampling_strategy, ## info about the sample
date, date_submitted, reporting_lag) ## datesus_dat %>%
group_by(Nextstrain_clade) %>%
summarize(n_samples = n()) | Nextstrain_clade | n_samples |
|---|---|
| 19A | 4445 |
| 19B | 3966 |
| 20A | 50509 |
| 20B | 23174 |
| 20C | 57723 |
| 20D | 613 |
| 20E (EU1) | 134 |
| 20G | 62703 |
| 20H (Beta, V2) | 2188 |
| 20I (Alpha, V1) | 184305 |
| 20J (Gamma, V3) | 20702 |
| 21A (Delta) | 40794 |
| 21B (Kappa) | 246 |
| 21C (Epsilon) | 39149 |
| 21D (Eta) | 1025 |
| 21E (Theta) | 20 |
| 21F (Iota) | 32890 |
| 21G (Lambda) | 963 |
| 21H (Mu) | 4970 |
| 21I (Delta) | 89213 |
| 21J (Delta) | 994549 |
| 21K (Omicron) | 223836 |
| 21L (Omicron) | 268 |
| 21M (Omicron) | 24 |
| NA | 194 |
Note that the sampling_strategy field is empty for all US data.
us_dat %>%
group_by(sampling_strategy) %>%
summarize(n_samples = n()) | sampling_strategy | n_samples |
|---|---|
| ? | 1838603 |
The following figure shows the reporting lags by state as computed in the data as the difference between the date the sample was taken and the date_submitted, which is the date the sample was submitted to the GenBank system. There appears to be substantial variation by location (note the shorter lags in MA, CA and VT), with 75% of samples typically reported by 1 month out. Note this is subsetting to look at all data starting in August of 2021. Some analyses, for example the analysis that the Bedford Lab has done, remove all samples from the last 10 days, to try to remove small sample effects in the early reporting.
us_dat %>%
filter(date > ymd("2021-08-01")) %>%
ggplot(aes(x=division, y=reporting_lag)) +
geom_boxplot() +
scale_y_log10(name= "lag (days, log scale)", breaks=c(7, 14, 21, 30, 90, 360, 720)) +
xlab(NULL) +
theme(axis.text.x = element_text(angle=60, vjust=1, hjust=1))Plot of samples over time
us_dat %>%
group_by(date) %>%
summarize(n_samples = n()) %>%
ggplot(aes(x=date, y=n_samples)) +
geom_bar(alpha=.3, stat="identity") +
geom_smooth(span=.1, se=FALSE)Here is a plot of the prevalence of each clade by week across 2021.
strains_of_interest <-
c("20I (Alpha, V1)",
"20J (Gamma, V3)",
"21A (Delta)",
"21C (Epsilon)",
"21I (Delta)",
"21J (Delta)",
"21K (Omicron)"
)
by_clade <- us_dat %>%
filter(year(date) == 2021) %>% ## focus only on 2021
filter(Nextstrain_clade %in% strains_of_interest) %>%
mutate(epiweek = epiweek(date)) %>%
filter(epiweek < 53) %>% ## values of 53 are at the start of 2021
group_by(Nextstrain_clade, epiweek) %>%
summarize(clade_total = n()) %>%
group_by(epiweek) %>%
mutate(total_samples = sum(clade_total),
pct_clade_in_epiweek = clade_total/total_samples)
p <- by_clade %>%
ggplot(aes(x=epiweek, y=pct_clade_in_epiweek, color=Nextstrain_clade)) +
geom_line()
plotly::ggplotly(p)